ANALYSIS OF PROSPER LOAN DATA by CHUNDI HIMAKIRAN KUMAR

## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

Prosper Loan Dataset: The original Prosper Loan Dataset contains information of loans cetgorized across 81 variables. For the purpose of this analysis we are subsetting this dataset to analyse the data pertaining to 16 variables namely ‘CreditGrade’,‘Term’,‘LoanStatus’,‘BorrowerAPR’, ‘BorrowerRate’,‘ListingCategory..numeric.’,‘BorrowerState’,‘Occupation’, ‘EmploymentStatus’,‘EmploymentStatusDuration’,‘IsBorrowerHomeowner’, ‘IncomeRange’,‘LoanNumber’,‘LoanOriginalAmount’,‘LoanOriginationDate’ and ‘LoanOriginationQuarter’.

Understanding the structure of the dataset

## 'data.frame':    113937 obs. of  16 variables:
##  $ CreditGrade              : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                     : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus               : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ BorrowerAPR              : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ ProsperScore             : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.: int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState            : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation               : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus         : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner      : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ IncomeRange              : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ LoanNumber               : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount       : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate      : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter   : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...

Understanding the range of values for each column:

CreditGrade:

## [1] ""   "A"  "AA" "B"  "C"  "D"  "E"  "HR" "NC"

Term in Months:

## [1] 12 60

Term in Months Distribution:

## 
##    12    36    60 
##  1614 87778 24545

*** Converting Term to factors ***

Loan Status:

##  [1] "Cancelled"              "Chargedoff"            
##  [3] "Completed"              "Current"               
##  [5] "Defaulted"              "FinalPaymentInProgress"
##  [7] "Past Due (1-15 days)"   "Past Due (16-30 days)" 
##  [9] "Past Due (31-60 days)"  "Past Due (61-90 days)" 
## [11] "Past Due (91-120 days)" "Past Due (>120 days)"

Loan Status Distribution:

## 
##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (1-15 days)  Past Due (16-30 days)  Past Due (31-60 days) 
##                    806                    265                    363 
##  Past Due (61-90 days) Past Due (91-120 days)   Past Due (>120 days) 
##                    313                    304                     16

Borrower APR:

## [1] 0.00653 0.51229

Loan Original Amount:

## [1]  1000 35000

Employment status duration in months:

## [1]   0 755

Prosper score for Loans:

## [1]  1 11
## 
##     1     2     3     4     5     6     7     8     9    10    11 
##   992  5766  7642 12595  9813 12278 10597 12053  6911  4750  1456

Univariate Plots Section

** Term distribution of loans **

##    12    36    60 
##  1614 87778 24545

It is observed that all the loans are either of 12,36 or 60 months duration. The maximum number of loans have a term of 36 months followed by 60 month term loans.

##           A    AA     B     C     D     E    HR    NC 
## 84984  3315  3509  4389  5649  5153  3289  3508   141

It is observed that max loans have no credit grade assigned. Among the rest ‘C’ Credit Grade loans are marginally higher than the rest grades.

##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (1-15 days)  Past Due (16-30 days)  Past Due (31-60 days) 
##                    806                    265                    363 
##  Past Due (61-90 days) Past Due (91-120 days)   Past Due (>120 days) 
##                    313                    304                     16

It is observed that approximately 5000 loans have defaulted status and maximum loans in the dataset are current. The pst due loans are comparitveley very less.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229      25

It is observed that maximum loans have been issued at an APR of 36%.

##                    Employed     Full-time Not available  Not employed 
##          2255         67322         26355          5347           835 
##         Other     Part-time       Retired Self-employed 
##          3806          1088           795          6134

It is observed that maximum loans have been issued to Employed people and very few loans have been issued to not employed and part time people.

##             $0      $1-24,999      $100,000+ $25,000-49,999 $50,000-74,999 
##            621           7274          17337          32192          31050 
## $75,000-99,999  Not displayed   Not employed 
##          16916           7741            806

The histogram is a perfect bell curve and shows that max loans have been given to the middle income groups with $25K-49K and $50K-74.99K and the loans fall off on either side as the income rises and falls.

## $title
## [1] "LoanOriginalAmount Histogram"
## 
## $subtitle
## NULL
## 
## attr(,"class")
## [1] "labels"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

The most frequently sanctioned Loan amounts seems to be 4K$, 10K and 15K dollars. There are also loans sanctioned for 35K dollars.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   1.000   2.774   3.000  20.000

Maximum loans have been sanctioned under listing catgeory 1(Debt Consolidation) followd by 0(Not Available) and 7(Other). Very few loans have been sanctioned to 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans followed by 18 - Taxes, 19 - Vacation, 20 - Wedding Loans.

##          AK    AL    AR    AZ    CA    CO    CT    DC    DE    FL    GA 
##  5515   200  1679   855  1901 14717  2210  1627   382   300  6720  5008 
##    HI    IA    ID    IL    IN    KS    KY    LA    MA    MD    ME    MI 
##   409   186   599  5921  2078  1062   983   954  2242  2821   101  3593 
##    MN    MO    MS    MT    NC    ND    NE    NH    NJ    NM    NV    NY 
##  2318  2615   787   330  3084    52   674   551  3097   472  1090  6729 
##    OH    OK    OR    PA    RI    SC    SD    TN    TX    UT    VA    VT 
##  4197   971  1817  2972   435  1122   189  1737  6842   877  3278   207 
##    WA    WI    WV    WY 
##  3048  1842   391   150

Max lons have been sanctioned to people from state of California followed by approximatley equal number of loans to people from Florida, Illinois, New York and Texas.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   26.00   67.00   96.07  137.00  755.00    7625

It is observed that max loans have been given to people who have only spent a month or two in their employment status and there is a clear downward march of loans given out decreasing as the employment status duration increases.

##     1     2     3     4     5     6     7     8     9    10    11  NA's 
##   992  5766  7642 12595  9813 12278 10597 12053  6911  4750  1456 29084

It is observed that max loans have a prosper score of 4,6 and 8 while there are comparitevely very few loans having the worst score of 1 and best score of 11.

Univariate Analysis

Main features of interest in your dataset?

Based on the above analysis the PropserScore, BorrowedAPR and the Loan Amount seem to be features of interest

Other features in the dataset which will help support investigation into

the above feature(s) of interest?

IsBorrowerHomeowner, Term and EmploymentStatusDuration are the features which could support the investigation.

Unusual Distributions..

The employment status duration histogram was unexpected as it showed that the loans granted to people decreased as their duration in the respective employment increased. It perhaps could be attributed to the fact that max loans were given out to new employees upon getting a job within the first two months.

Bivariate Plots Section

We see that maximum loans have been given to non home owners as compared to home owners across all the prosper score ratings.

## 
##    12    36    60 
##  1614 87778 24545

There are only three values of term ie 12,36 and 60 months.We can see that the median BorrowerAPR is approximately the same for all three terms. But the range of BorrowerAPR values is much larger for 36 Months Term Loans than others.

As the Loan amount increases the Borrower APR moves to a range between 0.1 to 0.2 from 0.05 to 0.4. There seems to be a negative corelation between Borrower APR and Loan Amount.

We want to improve the above plot by coverting the employment status duration from days to months.

We now see that Employment status duration month wise has a similar range of APR for all months. However we also observe that many loans with varying values of employment status duration have an APR whci is 0.35-0.36 percentage points.

Here also we shall try to change the value of EmploymentStatus duration from days to months to revisit this graph

We now see that Employment status duration for varying montha has similar Loan Amount distribution. But we also notice that maximum loans in each value of Employment status duration have a Loan Amounts of 10K\(,15K\), 20K$ and 25K$.

We see that median Loan Amount increases with increase in the term distribution.

We see that as the PropserScore increases the Median LoanOriginalAmount also tends to increase.

The median Employment status duration does not seem to change much with the propser score.

We can see that the median Borrower APR decreases with increasing Prosper Score..Hence we would want to further analyze this relationship as under. We shall group the data by prosper score and try to find the Borrower APR mean for each group and then plot the mean vs the Prosper Score. To do this we shall use ‘dplyr’ package and derive a dataframe by grouping the existing dataframe by ProsperScore.

As expected we find a very close inverse linear relationship between ProsperScore and the mean Borrower APR. We shall also try and find a relationship between the ProsperScore and the Loan Amount sanctioned.

We see that there is an increasing linear relationship between the mean Loan Amount and the Prosper Score.

Bivariate Analysis

Relationships of Interest

By grouping the loans using the ProsperScore and then finding the BorrowerAPR mean of each group we were able to plot and see the relationship between the ProsperSCore and BorrowerAPR mean as well as ProsperScore and LoanOriginalAmount. We saw that highly rated loans had more Loan Amount Sanctioned and lesser Borrower APR.

Multivariate Plots Section

We see that across the prosper score ratings People who are not Home owners(red color) tend to have higher Borrower APR. We also see that as the propser score increases (gets better) the Borrower APR goes down.

We wanted to further understand the relationship between Loan Original Amount and Borrower APR. By colouring the data using the HomeOwner variable we see that maximum loans having APR from 0.05 to 0.4 have been given to the HomeOwners. Loans with Loan Amounts more than 0.4 are mostly given to People without home ownership.

We wanted to extend the analysis of ProsperScore and the LoanOriginalAmount using the home owner variable. From the graph above we see that People who are not home owners form the majority of loans that have lesser loan original amounts as compared to loans of home owners.Loans upward of 10K$ are mostly concentrated with Home owners. It is also seen that Loans with prosper score of 9 and above are predominantly of Home owners.

Multivariate Analysis

We have used the IsBorrowerHomeowner variable to extend our analysis between the ProsperScore , BorrowerAPR against the LoanOriginalAmount. The aim was to see if the variable IsBorrowerHomeOwner would give us any indication about the interplay of these three variables.From the above two graphs we can conclude that Loans given to Homeowners have better APR, Larger LoanAmounts and better prosper scores as compared to non home owners.


Final Plots and Summary

Plot One

Description One

The mean Loan Amount of loans grouped by ProsperScore has a clear approximate increasing linear relationship with the ProsperScore. It can be understood logically that highly rated loans have a larger Loan Amount than loans with poor rating.

Plot Two

Description Two

As an inverse of the previous graph we can see here that the APR mean falls with increasing Prosper Score. Again this can be understood in a way that the loans with higher Prosper SCore have a mean APR lower than the loans with poorer Prosper Scores.

Plot Three

Description Three

We see that across the prosper score ratings People who are not Home owners(red color) tend to have higher Borrower APR. We also see that as the propser score increases (gets better) the Borrower APR goes down.


Reflection

It was a difficult process to analyze this dataset. The intial difficulty lay in trying to guess the interesting variables for analysis from a total of 81 variables. The distributions of the variables helped understand the nature of the data and the distribution which in turn help guess the pair of variables that could seem to have some relationship between them. After narrowing down to the interesting relationship between Borrowed APR and Loan Amount it was decided to explore this relationship further using the ProsperScore variable. Finally we succeeded in establishing that both the Loan Amount and Borrowed APR varied in a linear fashion compared to the ProsperScore variable.

Further work can include fitting a model to Final Plots One and Two..